non-stationary bandit
Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret
The stochastic multi-armed bandit setting has been recently studied in the non-stationary regime, where the mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played. This model captures natural behavioral aspects of the users which crucially determine the performance of recommendation platforms, ad placement systems, and more. Even assuming prior knowledge of the mean payoff functions, computing an optimal planning in the above model is NP-hard, while the state-of-the-art is a 1/4 -approximation algorithm for the case where at most one arm can be played per round. We first focus on the setting where the mean payoff functions are known. In this setting, we significantly improve the best-known guarantees for the planning problem by developing a polynomial-time (1-{1}/{e}) -approximation algorithm (asymptotically and in expectation), based on a novel combination of randomized LP rounding and a time-correlated (interleaved) scheduling method.
- Information Technology > Data Science > Data Mining > Big Data (0.42)
- Information Technology > Artificial Intelligence > Machine Learning (0.42)
Non-stationary Bandits with Knapsacks
In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment.
Non-Stationary Bandits with Auto-Regressive Temporal Dependency
Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.
Adaptive Smooth Non-Stationary Bandits
We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by H\"{o}lder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a H\"{o}lder exponent $\beta$ and coefficient $\lambda$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,\beta,\lambda$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $\beta,\lambda$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $\beta\leq 1$ and $\beta=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $\sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth H\"{o}lder class model.
- North America > United States > New York (0.04)
- North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.47)
A Definition of Non-Stationary Bandits
Liu, Yueyang, Kuang, Xu, Van Roy, Benjamin
Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones. Prior work has characterized non-stationary bandits as bandits for which the reward distribution changes over time. We demonstrate that this definition can ambiguously classify the same bandit as both stationary and non-stationary; this ambiguity arises in the existing definition's dependence on the latent sequence of reward distributions. Moreover, the definition has given rise to two widely used notions of regret: the dynamic regret and the weak regret. These notions are not indicative of qualitative agent performance in some bandits. Additionally, this definition of non-stationary bandits has led to the design of agents that explore excessively. We introduce a formal definition of non-stationary bandits that resolves these issues. Our new definition provides a unified approach, applicable seamlessly to both Bayesian and frequentist formulations of bandits. Furthermore, our definition ensures consistent classification of two bandits offering agents indistinguishable experiences, categorizing them as either both stationary or both non-stationary. This advancement provides a more robust framework for non-stationary bandit learning.
Non-Stationary Bandit Learning via Predictive Sampling
Liu, Yueyang, Kuang, Xu, Van Roy, Benjamin
Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. A theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.
- North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Africa > Sudan (0.04)
On Slowly-varying Non-stationary Bandits
Krishnamurthy, Ramakrishnan, Gopalan, Aditya
Reinforcement learning, and specifically bandit optimization, in dynamically changing environments has remained an active topic of study in machine learning. A variety of non-stationary bandit settings have been studied incorporating a range of structural assumptions. At one end are classical stochastic models such as restless bandits [Whittle, 1988], where the state of the arms governs the bandit problem at any instant, but the transitions between these problems (states) follow probabilistic dynamics. At the other extreme are settings with non-stochastic and arbitrarily changing rewards such as prediction with expert advice (and the EXP3 algorithm)[Cesa-Bianchi and Lugosi, 2006; Auer et al., 2002]. In between these extremes lie settings of changing environments where the adversary (environment) is assumed to be limited in its ability to change the rewards, i.e., a structural constraint is put on the amount of change in the rewards across time. These include the abrupt change (or switching experts) model [Garivier and Moulines, 2011], where at most k arbitrary changes to the reward distributions are allowed in the entire time horizon, and the variation-budgeted (drifting) change model [Besbes et al., 2014], in which the total magnitude of changes (of rewards) across successive time steps is constrained to be within an overall budget. In this paper, we focus on slowly-varying bandits - a different and arguably commonly encountered, yet less studied, model of non-stationary bandits. In this setting, the arms are allowed to change arbitrarily over time as long as the amount of change in their mean rewards between two successive time steps is bounded uniformly across the horizon. Many real-world settings naturally involve observables whose distributions are'smooth' over time, in the sense that their instantaneous rate of change is not too large, e.g., slowly drifting distributions in natural language tasks [Lu et al., 2020], data from physical transducers (position, velocity, power, temperature, chemical concentration), and slowly fading wireless